Skip to content

ateom: avoid unix socket path length limits#100

Open
Benjamin Elder (BenTheElder) wants to merge 1 commit into
agent-substrate:mainfrom
BenTheElder:socket-len
Open

ateom: avoid unix socket path length limits#100
Benjamin Elder (BenTheElder) wants to merge 1 commit into
agent-substrate:mainfrom
BenTheElder:socket-len

Conversation

@BenTheElder
Copy link
Copy Markdown
Collaborator

@BenTheElder Benjamin Elder (BenTheElder) commented May 28, 2026

Pod names alone can easily exceed the max length (253 > 107).

Add a symlink with the full pod name for debugging.

Follow-up #92

It's a good idea to open an issue first for discussion.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

TODO: I'm not sure about the symlink. It's annoying mapping pod => hash(pod) otherwise though. We could put a file inside the directory with the pod name instead, but I think we should probably put the full pod name somewhere in the structure so it's easier to inspect these.

@a4-a4s1
Copy link
Copy Markdown

a4-a4s1 Bot commented May 28, 2026

Skim — hash-with-debug-symlink approach is clean; the 12-char budget derivation in the comment is exactly what I'd want when re-reading this in 6 months.

One observation worth raising before this lands:

  • os.Symlink(ateomDir, debugAteomDir) is not idempotent across restarts. MkdirAll above it is (returns nil if the dir exists); RemoveAll(sockPath) a few lines down is too. But os.Symlink returns EEXIST if the link already exists → on the 2nd start of an ateom for the same (namespace, podName) pair, this line errors and the process exits.
  • Whether it matters depends on the /run/ateom-gvisor lifecycle:
    • per-container tmpfs blown away on container restart → no concern, drop this
    • hostPath / persistent emptyDir surviving container restarts within a pod → restart loop on 2nd start
  • If it's a concern, cheapest fix is os.Remove(debugAteomDir) (ignoring os.IsNotExist) before the Symlink — symmetric with the RemoveAll(sockPath) pattern a few lines down.

Q: is /run/ateom-gvisor's lifecycle scoped to the ateom process, or does it survive an ateom container restart within the same pod?

@BenTheElder
Copy link
Copy Markdown
Collaborator Author

If it's a concern, cheapest fix is os.Remove(debugAteomDir) (ignoring os.IsNotExist) before the Symlink — symmetric with the RemoveAll(sockPath) pattern a few lines down.

Agreed. Done.

This is a host mount, which you can discover from the install manifests.

Comment thread internal/ateompath/ateompath.go Outdated
Comment thread internal/ateompath/ateompath.go Outdated
@a4-a4s1
Copy link
Copy Markdown

a4-a4s1 Bot commented May 28, 2026

Confirmed via manifests/ate-install/atelet.yaml/run/ateom-gvisor is hostPath with DirectoryOrCreate, which means:

  • ateom container restart within the same pod → symlink survives → EEXIST on 2nd os.Symlink → process exits → restart loop.
  • Pod eviction to a different node → fresh path → no issue.

Your os.Remove fix resolves the realistic crashloop path. Thanks for the host-mount pointer — useful context for future ateom-side work.

@BenTheElder
Copy link
Copy Markdown
Collaborator Author

Technically we could also solve this with Chdir and using a short relative path but I went with this approach because we'd have to be careful to lock to an OS thread without CLONE_FS and isolate it in atelet or else race on process-wide chdir.

Thinking about it more, maybe we want to make the symlink the shortened path and use that just for socket binding.

Comment thread internal/ateompath/ateompath.go Outdated
@@ -41,6 +37,33 @@ func RunSCBinaryPath(sha256 string) string {
}

func AteomPath(ateomNamespace, ateomName string) string {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the pod UID for the ateom path? That will always fit in the correct length, and is reasonably discoverable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Will update.

@BenTheElder Benjamin Elder (BenTheElder) changed the title ateom: hash pod name to avoid unix socket path length limits ateom: avoid unix socket path length limits May 28, 2026
@BenTheElder
Copy link
Copy Markdown
Collaborator Author

Benjamin Elder (BenTheElder) commented May 28, 2026

EDIT: Squashed down.

The diff is a bit more involved ... but I think this is still better architecturally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working / bugfixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants